Negative hypergeometric distribution (nhypergeom)#
The negative hypergeometric distribution models sequential sampling without replacement from a finite population.
SciPy’s parameterization (used in this notebook):
A population has
Mtotal items:nred andM-nblue.Draw items one-by-one without replacement until you have drawn
rblue items.The random variable is $\( X = \text{# of red items drawn when the } r\text{-th blue item appears}. \)$
Learning goals#
Classify the distribution and understand the support/constraints.
Write the PMF and CDF (and connect them to the hypergeometric distribution).
Compute moments (mean/variance/skewness/kurtosis) and interpret parameters.
Derive expectation/variance and write a practical log-likelihood.
Implement NumPy-only simulation algorithms.
Visualize PMF/CDF and validate with Monte Carlo.
Use
scipy.stats.nhypergeomandscipy.stats.fit.
Prerequisites#
Basic probability (PMF, CDF, expectation)
Combinatorics (binomial coefficients)
Familiarity with sampling without replacement (hypergeometric distribution)
Notebook roadmap#
Title & Classification
Intuition & Motivation
Formal Definition
Moments & Properties
Parameter Interpretation
Derivations (Expectation, Variance, Likelihood)
Sampling & Simulation (NumPy-only)
Visualization (PMF, CDF, Monte Carlo)
SciPy Integration
Statistical Use Cases
Pitfalls
Summary
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from scipy import special, stats
from scipy.special import logsumexp, xlogy
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
SEED = 7
rng = np.random.default_rng(SEED)
import sys
import scipy
import plotly
print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("SciPy:", scipy.__version__)
print("Plotly:", plotly.__version__)
print("Seed:", SEED)
Python: 3.12.9
NumPy: 1.26.2
SciPy: 1.15.0
Plotly: 6.5.2
Seed: 7
1) Title & Classification#
Name:
nhypergeom(negative hypergeometric distribution)Type: Discrete
Support: \(k \in \{0,1,\dots,n\}\)
Parameter space (SciPy):
\(M \in \{1,2,\dots\}\) (population size)
\(n \in \{0,1,\dots,M\}\) (number of red items)
\(r \in \{0,1,\dots,M-n\}\) (number of blue items to draw before stopping)
Notation:
\(X \sim \mathrm{NHG}(M,n,r)\) is common in texts.
In SciPy:
stats.nhypergeom(M, n, r).
Non-degenerate case: typically \(r\ge 1\) and \(0<n<M\).
2) Intuition & Motivation#
What this distribution models#
You have a finite population with two types (red/blue). You sample without replacement until you have observed r blue items. The distribution describes how many red items you saw by that stopping time.
This is an “inverse” version of the hypergeometric setting:
Hypergeometric: fix the number of draws, count how many reds you see.
Negative hypergeometric (
nhypergeom): fix the number of blues you must see, count how many reds you see before that happens.
Typical real-world use cases#
Quality control / audit sampling: inspect items from a finite lot until you find
rdefects (blue); how many non-defects (red) did you see?Reliability / screening: test units without replacement until
rfailures are found; count the passes.Games / cards: draw from a deck until you see
rcards of a certain category; count another category.
Relations to other distributions#
Hypergeometric connection: the PMF can be written using a hypergeometric PMF on the first \(k+r-1\) draws and the probability the \((k+r)\)-th draw is blue.
Negative binomial limit: as \(M\to\infty\) with \(n/M\to p\) fixed (sampling becomes “with replacement”),
nhypergeomapproaches a negative binomial distribution for the number of red successes beforerblue failures.Beta-binomial identity:
nhypergeom(M,n,r)is equivalent to a beta-binomial distribution with a particular parameter mapping (used below for moments).
3) Formal Definition#
PMF#
Let \(M\) be the population size, with \(n\) red and \(M-n\) blue items. Sample without replacement until \(r\) blue items have been drawn.
For \(k \in \{0,1,\dots,n\}\), the PMF is
A useful interpretation:
In the first \(k+r-1\) draws, you must have seen exactly \(k\) reds (and \(r-1\) blues).
The next draw (the \((k+r)\)-th) must be blue.
SciPy states the relationship to the hypergeometric PMF as: $\( \mathrm{NHG}(k;M,n,r) = \mathrm{HG}(k;M,n,k+r-1)\,\frac{M-n-(r-1)}{M-(k+r-1)}. \)$
CDF#
For integer \(k\), $\( F(k) = \Pr(X \le k) = \sum_{j=0}^{k} \Pr(X=j). \)$
There is also a helpful hypergeometric view: by time \(k+r\) you have drawn \(k+r\) items; having stopped by then is equivalent to having seen at least \(r\) blues, i.e. at most \(k\) reds. If \(Y \sim \mathrm{Hypergeom}(M, n, k+r)\) counts reds in the first \(k+r\) draws, then $\( F(k) = \Pr(Y \le k). \)$
def validate_nhypergeom_params(M: int, n: int, r: int):
M = int(M)
n = int(n)
r = int(r)
if M <= 0:
raise ValueError(f"M must be a positive integer, got {M!r}")
if not (0 <= n <= M):
raise ValueError(f"n must satisfy 0 <= n <= M, got n={n!r}, M={M!r}")
if not (0 <= r <= M - n):
raise ValueError(
f"r must satisfy 0 <= r <= M-n, got r={r!r}, M={M!r}, n={n!r}"
)
return M, n, r
def _is_integral(x: np.ndarray) -> np.ndarray:
x = np.asarray(x)
return np.isfinite(x) & (x == np.floor(x))
def log_choose(N, K):
"""log binomial coefficient log(C(N,K)) with broadcasting.
Returns -inf for invalid K.
"""
N = np.asarray(N)
K = np.asarray(K)
N, K = np.broadcast_arrays(N, K)
out = np.full(N.shape, -np.inf, dtype=float)
valid = (K >= 0) & (K <= N)
out[valid] = (
special.gammaln(N[valid] + 1)
- special.gammaln(K[valid] + 1)
- special.gammaln(N[valid] - K[valid] + 1)
)
return out
def nhypergeom_logpmf(k, M: int, n: int, r: int):
"""Log PMF of nhypergeom in SciPy's parameterization (vectorized in k)."""
M, n, r = validate_nhypergeom_params(M, n, r)
k = np.asarray(k)
# Handle degenerate r=0: stop immediately, so P(X=0)=1.
if r == 0:
return np.where(k == 0, 0.0, -np.inf).astype(float)
integral = _is_integral(k)
k_int = np.floor(k).astype(int)
valid_k = integral & (0 <= k_int) & (k_int <= n)
logp = np.full(k.shape, -np.inf, dtype=float)
if np.any(valid_k):
kv = k_int[valid_k]
log_num1 = log_choose(kv + r - 1, kv)
log_num2 = log_choose(M - r - kv, n - kv)
log_den = log_choose(M, n)
logp[valid_k] = log_num1 + log_num2 - log_den
return logp
def nhypergeom_pmf(k, M: int, n: int, r: int):
return np.exp(nhypergeom_logpmf(k, M=M, n=n, r=r))
def nhypergeom_pmf_array(M: int, n: int, r: int):
M, n, r = validate_nhypergeom_params(M, n, r)
ks = np.arange(0, n + 1)
pmf = nhypergeom_pmf(ks, M=M, n=n, r=r)
return ks, pmf
def nhypergeom_cdf_array(M: int, n: int, r: int):
ks, pmf = nhypergeom_pmf_array(M, n, r)
return ks, np.cumsum(pmf)
def nhypergeom_cdf(k, M: int, n: int, r: int):
"""CDF computed by summing the finite PMF (vectorized in k)."""
M, n, r = validate_nhypergeom_params(M, n, r)
k = np.asarray(k)
if r == 0:
return np.where(k < 0, 0.0, 1.0).astype(float)
ks, cdf = nhypergeom_cdf_array(M, n, r)
# For non-integer k, use floor(k).
k_floor = np.floor(k).astype(int)
out = np.zeros_like(k_floor, dtype=float)
out[k_floor >= n] = 1.0
mid = (k_floor >= 0) & (k_floor < n)
out[mid] = cdf[k_floor[mid]]
return out
# Demo
M_demo, n_demo, r_demo = 40, 12, 5
ks_demo, pmf_demo = nhypergeom_pmf_array(M_demo, n_demo, r_demo)
cdf_demo = np.cumsum(pmf_demo)
print("support k:", (ks_demo[0], ks_demo[-1]))
print("sum pmf:", pmf_demo.sum())
print("mean (pmf sum):", (ks_demo * pmf_demo).sum())
print("scipy mean:", stats.nhypergeom.mean(M_demo, n_demo, r_demo))
support k: (0, 12)
sum pmf: 1.0000000000000013
mean (pmf sum): 2.068965517241385
scipy mean: 2.0689655172413794
4) Moments & Properties#
A convenient way to get moments is via an identity with the beta-binomial distribution.
Let \(m = M-n\) be the number of blue items. If you view blue as “successes” and red as “failures”, then drawing until \(r\) blues means you are counting red failures before \(r\) blue successes. One can show:
This implies clean closed forms for the first four standardized moments.
Let $\( N=n,\quad \alpha=r,\quad \beta=M-n-r+1,\quad s=\alpha+\beta=M-n+1. \)$
Mean#
Variance#
Skewness and excess kurtosis#
Using the beta-binomial formulas:
A closed form for excess kurtosis (kurtosis minus 3) is: let \(t=\alpha\beta\), then
PGF / MGF / characteristic function#
Because the support is finite (\(0\le X\le n\)), all moments exist and the MGF exists for all real \(t\).
The probability generating function (PGF) can be written via the Gauss hypergeometric function: $\( G(z)=\mathbb{E}[z^X] = {}_2F_1\left(-N,\;\alpha;\;\alpha+\beta;\;1-z\right). \)$ Then
\(M(t)=\mathbb{E}[e^{tX}] = G(e^t)\)
\(\varphi(t)=\mathbb{E}[e^{itX}] = G(e^{it})\)
Entropy#
There is no simple closed form in general; numerically: $\( H(X) = -\sum_{k=0}^n p(k)\,\log p(k). \)$
def nhypergeom_mean(M: int, n: int, r: int) -> float:
M, n, r = validate_nhypergeom_params(M, n, r)
if r == 0 or n == 0:
return 0.0
return (r * n) / (M - n + 1)
def nhypergeom_var(M: int, n: int, r: int) -> float:
M, n, r = validate_nhypergeom_params(M, n, r)
if r == 0 or n == 0:
return 0.0
denom = (M - n + 1) ** 2 * (M - n + 2)
return (r * (M + 1) * n * (M - n - r + 1)) / denom
def nhypergeom_skewness(M: int, n: int, r: int) -> float:
M, n, r = validate_nhypergeom_params(M, n, r)
v = nhypergeom_var(M, n, r)
if v == 0.0:
return np.nan
N = n
alpha = r
beta = M - n - r + 1
s = alpha + beta
num = (s + 2 * N) * (beta - alpha) * np.sqrt(s + 1)
den = (s + 2) * np.sqrt(N * alpha * beta * (s + N))
return float(num / den)
def nhypergeom_excess_kurtosis(M: int, n: int, r: int) -> float:
M, n, r = validate_nhypergeom_params(M, n, r)
v = nhypergeom_var(M, n, r)
if v == 0.0:
return np.nan
N = n
alpha = r
beta = M - n - r + 1
s = alpha + beta
t = alpha * beta
poly = (
s**4
+ (6 * N - 1) * s**3
+ (6 * N**2 + 3 * t * (N - 2)) * s**2
- 3 * t * N * (6 - N) * s
- 18 * t * N**2
)
return float(((s + 1) / (N * t * (s + N) * (s + 2) * (s + 3))) * poly - 3)
def nhypergeom_pgf(z: complex, M: int, n: int, r: int) -> complex:
"""Probability generating function via the beta-binomial identity."""
M, n, r = validate_nhypergeom_params(M, n, r)
if r == 0:
return 1.0 + 0j
N = n
alpha = r
beta = M - n - r + 1
return complex(special.hyp2f1(-N, alpha, alpha + beta, 1 - z))
def nhypergeom_mgf(t: float, M: int, n: int, r: int) -> float:
return float(np.real(nhypergeom_pgf(np.exp(t), M=M, n=n, r=r)))
def nhypergeom_cf(t: float, M: int, n: int, r: int) -> complex:
return nhypergeom_pgf(np.exp(1j * t), M=M, n=n, r=r)
def nhypergeom_entropy(M: int, n: int, r: int, base=np.e) -> float:
"""Entropy computed from the finite PMF (base=e -> nats, base=2 -> bits)."""
ks, pmf = nhypergeom_pmf_array(M, n, r)
H_nats = -np.sum(xlogy(pmf, pmf))
return float(H_nats / np.log(base))
# Quick sanity check vs SciPy
M, n, r = 40, 12, 5
print("mean (formula):", nhypergeom_mean(M, n, r))
print("var (formula):", nhypergeom_var(M, n, r))
print("skew (formula):", nhypergeom_skewness(M, n, r))
print("kurt_excess (formula):", nhypergeom_excess_kurtosis(M, n, r))
print("entropy (nats, numeric):", nhypergeom_entropy(M, n, r))
print("SciPy (mvsk):", stats.nhypergeom.stats(M, n, r, moments="mvsk"))
print("MGF(0) should be 1:", nhypergeom_mgf(0.0, M, n, r))
mean (formula): 2.0689655172413794
var (formula): 2.3400713436385256
skew (formula): 0.732243001701697
kurt_excess (formula): 0.3979990329137162
entropy (nats, numeric): 1.769665911112108
SciPy (mvsk): (2.0689655172413794, 2.3400713436385256, 0.7322430017017011, 0.39799903291370464)
MGF(0) should be 1: 1.0
5) Parameter Interpretation#
Parameters in the sampling story:
M: total population size.n: number of red items (often thought of as “good”, “success”, or a target category).r: how many blue items you require before stopping (a design/stop rule).
The mean $\( \mathbb{E}[X] = \frac{rn}{M-n+1} \)$ makes some qualitative behavior obvious:
Increasing
rincreases the expected number of reds roughly linearly.Increasing
n(more reds, fewer blues) increases \(\mathbb{E}[X]\) sharply because blues become rarer.Increasing
Mwhile holdingnfixed decreases the red fraction and typically decreases \(\mathbb{E}[X]\).
Edge behavior:
If \(r=0\), you stop immediately and \(X=0\).
If \(n=0\), there are no reds and \(X=0\).
If \(r=M-n\) (you require all blues), \(X\) tends to be large and often close to \(n\).
6) Derivations#
6.1 Expectation via symmetry of “gaps”#
Think of a random permutation of the \(M\) items. Let \(m=M-n\) be the number of blue items.
Place the \(m\) blues in the permutation. This creates \(m+1\) gaps of red items:
\(G_0\): # reds before the 1st blue
\(G_1\): # reds between the 1st and 2nd blue
…
\(G_m\): # reds after the last blue
These gaps satisfy \(G_0+\cdots+G_m = n\).
By symmetry (all placements of the \(m\) blue positions are equally likely), the joint distribution of \((G_0,\dots,G_m)\) is exchangeable, hence $\( \mathbb{E}[G_0]=\cdots=\mathbb{E}[G_m] = \frac{n}{m+1}. \)$
If we stop at the \(r\)-th blue, then the number of reds observed is the sum of the first \(r\) gaps: $\( X = G_0+\cdots+G_{r-1}. \)\( Therefore \)\( \mathbb{E}[X] = r\,\frac{n}{m+1} = \frac{rn}{M-n+1}. \)$
6.2 Variance via Dirichlet-multinomial / beta-binomial#
The gap vector \((G_0,\dots,G_m)\) is uniform over compositions of \(n\) into \(m+1\) nonnegative integers. This is exactly a Dirichlet-multinomial distribution with concentration parameters all equal to 1.
A key closure property: sums of Dirichlet-multinomial components are beta-binomial. Since \(X\) is the sum of \(r\) gaps, it follows that $\( X \sim \mathrm{BetaBinomial}(N=n,\alpha=r,\beta=m-r+1). \)\( Plugging into the beta-binomial variance formula yields \)\( \mathrm{Var}(X) = \frac{r\,(M+1)\,n\,(M-n-r+1)}{(M-n+1)^2\,(M-n+2)}. \)$
6.3 Likelihood (i.i.d. replicated experiments)#
If we observe i.i.d. samples \(k_1,\dots,k_T\) from \(\mathrm{NHG}(M,n,r)\) (e.g., repeated experiments on comparable lots), the likelihood is $\( L(M,n,r; k_{1:T}) = \prod_{t=1}^T \Pr(X=k_t\mid M,n,r). \)$
Because parameters are integers with constraints, maximum likelihood is typically done via grid search or specialized integer optimization.
def nhypergeom_loglikelihood(data: np.ndarray, M: int, n: int, r: int) -> float:
data = np.asarray(data)
return float(np.sum(nhypergeom_logpmf(data, M=M, n=n, r=r)))
# Example: infer n (reds) with M,r known via grid search
M_true, n_true, r_true = 60, 18, 6
dist_true = stats.nhypergeom(M_true, n_true, r_true)
data = dist_true.rvs(size=500, random_state=rng)
n_min = int(data.max())
n_max = M_true - r_true # because r <= M-n
n_grid = np.arange(n_min, n_max + 1)
ll = np.array([nhypergeom_loglikelihood(data, M_true, n0, r_true) for n0 in n_grid])
n_mle = int(n_grid[np.argmax(ll)])
fig = go.Figure()
fig.add_trace(go.Scatter(x=n_grid, y=ll, mode="lines", name="log-likelihood"))
fig.add_vline(x=n_true, line_dash="dash", line_color="gray", annotation_text="n_true")
fig.add_vline(x=n_mle, line_dash="dot", line_color="black", annotation_text="n_MLE")
fig.update_layout(
title=f"NHG log-likelihood for n (M={M_true}, r={r_true})",
xaxis_title="n (number of reds)",
yaxis_title="log L",
)
fig.show()
print("n_true:", n_true)
print("n_MLE:", n_mle)
n_true: 18
n_MLE: 18
7) Sampling & Simulation (NumPy-only)#
We want to sample \(X\) without calling SciPy.
Algorithm A (order statistic of blue positions)#
A random permutation of the \(M\) items is equivalent to choosing the positions of the blue items uniformly.
Let \(m=M-n\) be the number of blue items and let \(T_r\) be the position (1-indexed) of the \(r\)-th blue in the permutation. Then the number of reds before the \(r\)-th blue is $\( X = T_r - r. \)$
So a simple sampler is:
sample the \(m\) blue positions without replacement,
sort them,
take the \(r\)-th smallest position.
Algorithm B (sequential urn simulation)#
Maintain remaining counts of red and blue. At each draw, draw red with probability $\( \Pr(\text{red next}) = \frac{n_{\text{rem}}}{n_{\text{rem}} + m_{\text{rem}}}, \)\( update counts, and stop when \)r$ blues have been drawn.
Algorithm A is usually faster for moderate \(M\) because it avoids simulating each draw.
def sample_nhypergeom_positions(M: int, n: int, r: int, size: int, rng: np.random.Generator):
"""NumPy-only sampler via blue position order statistics."""
M, n, r = validate_nhypergeom_params(M, n, r)
size = int(size)
if size < 0:
raise ValueError("size must be non-negative")
if r == 0 or n == 0:
return np.zeros(size, dtype=int)
m = M - n # # blues
out = np.empty(size, dtype=int)
for i in range(size):
blue_pos = rng.choice(M, size=m, replace=False)
blue_pos.sort()
t0 = blue_pos[r - 1] # 0-indexed position of r-th blue
out[i] = t0 - (r - 1)
return out
def sample_nhypergeom_sequential(M: int, n: int, r: int, size: int, rng: np.random.Generator):
"""NumPy-only sampler via sequential draws without replacement."""
M, n, r = validate_nhypergeom_params(M, n, r)
size = int(size)
if size < 0:
raise ValueError("size must be non-negative")
if r == 0 or n == 0:
return np.zeros(size, dtype=int)
out = np.empty(size, dtype=int)
for i in range(size):
red_rem = n
blue_rem = M - n
red_seen = 0
blue_seen = 0
while blue_seen < r:
p_red = red_rem / (red_rem + blue_rem)
if rng.random() < p_red:
red_rem -= 1
red_seen += 1
else:
blue_rem -= 1
blue_seen += 1
out[i] = red_seen
return out
# Compare samplers + SciPy moments
M, n, r = 60, 18, 6
size = 50_000
x_pos = sample_nhypergeom_positions(M, n, r, size=size, rng=rng)
x_seq = sample_nhypergeom_sequential(M, n, r, size=size, rng=rng)
print("pos sampler mean/var:", x_pos.mean(), x_pos.var())
print("seq sampler mean/var:", x_seq.mean(), x_seq.var())
print("theory mean/var:", nhypergeom_mean(M, n, r), nhypergeom_var(M, n, r))
pos sampler mean/var: 2.5059 3.006525189999999
seq sampler mean/var: 2.5165 2.97880775
theory mean/var: 2.511627906976744 2.9961650031958307
8) Visualization#
We’ll visualize:
the PMF on \(\{0,\dots,n\}\)
the CDF (a step function)
Monte Carlo samples vs the exact PMF
M, n, r = 60, 18, 6
ks, pmf = nhypergeom_pmf_array(M, n, r)
cdf = np.cumsum(pmf)
fig_pmf = go.Figure()
fig_pmf.add_trace(go.Bar(x=ks, y=pmf, name="PMF"))
fig_pmf.update_layout(
title=f"Negative hypergeometric PMF (M={M}, n={n}, r={r})",
xaxis_title="k (reds drawn)",
yaxis_title="P(X=k)",
)
fig_pmf.show()
fig_cdf = go.Figure()
fig_cdf.add_trace(go.Scatter(x=ks, y=cdf, mode="lines", line_shape="hv", name="CDF"))
fig_cdf.update_layout(
title=f"Negative hypergeometric CDF (M={M}, n={n}, r={r})",
xaxis_title="k (reds drawn)",
yaxis_title="P(X≤k)",
)
fig_cdf.show()
mc = sample_nhypergeom_positions(M, n, r, size=150_000, rng=rng)
counts = np.bincount(mc, minlength=n + 1)
pmf_hat = counts / counts.sum()
fig_mc = go.Figure()
fig_mc.add_trace(go.Bar(x=ks, y=pmf_hat, name="Monte Carlo", opacity=0.6))
fig_mc.add_trace(go.Scatter(x=ks, y=pmf, mode="markers+lines", name="Exact PMF"))
fig_mc.update_layout(
title=f"Monte Carlo vs exact PMF (size={mc.size:,})",
xaxis_title="k (reds drawn)",
yaxis_title="Probability",
)
fig_mc.show()
9) SciPy Integration#
SciPy provides scipy.stats.nhypergeom as an rv_discrete distribution.
Common methods:
pmf(k, M, n, r)/logpmf(...)cdf(k, M, n, r)/sf(...)rvs(M, n, r, size=..., random_state=...)stats(M, n, r, moments='mvsk'),entropy(...)
About fit#
rv_discrete objects like nhypergeom do not expose a .fit(...) method.
In SciPy 1.15+, use the generic scipy.stats.fit function, which supports constraints and integer parameters.
dist = stats.nhypergeom
M, n, r = 60, 18, 6
print("pmf(0..5):", dist.pmf(np.arange(6), M, n, r))
print("cdf(0..5):", dist.cdf(np.arange(6), M, n, r))
print("mean/var:", dist.stats(M, n, r, moments="mv"))
samples = dist.rvs(M, n, r, size=10, random_state=rng)
print("rvs:", samples)
# Fitting example: estimate n with M and r fixed (bounds enforce integrality/constraints)
data = dist.rvs(M, n, r, size=2_000, random_state=rng)
fit_res = stats.fit(
dist,
data,
bounds={
"M": (M, M),
"r": (r, r),
"n": (0, M - r),
"loc": (0, 0),
},
)
print(fit_res)
print("n_hat:", int(fit_res.params.n))
pmf(0..5): [0.1048 0.2096 0.2353 0.193 0.1277 0.0715]
cdf(0..5): [0.1048 0.3143 0.5496 0.7426 0.8704 0.9419]
mean/var: (2.511627906976744, 2.9961650031958307)
rvs: [1 4 6 0 1 0 5 1 2 1]
params: FitParams(M=60.0, n=18.0, r=6.0, loc=0.0)
success: True
message: 'Optimization terminated successfully.'
n_hat: 18
10) Statistical Use Cases#
10.1 Hypothesis testing#
If \(M\) and the stopping rule \(r\) are fixed by the design, you can test hypotheses about \(n\) (how many red items are in the population).
Example (one-sided):
\(H_0\): \(n=n_0\) (a claimed red count)
\(H_1\): \(n<n_0\) (fewer reds than claimed)
An observed \(k_\text{obs}\) that is too small is evidence against \(H_0\). A p-value is $\( p = \Pr(X\le k_\text{obs}\mid M,n_0,r) = F(k_\text{obs}). \)$
10.2 Bayesian modeling (posterior on \(n\))#
Because \(n\) is an integer, a practical approach is a discrete prior on \(n\) and an exact posterior by enumeration: $\( \Pr(n\mid k) \propto \Pr(k\mid n)\,\Pr(n). \)$
10.3 Generative modeling#
nhypergeom is a useful component when your data arise from finite-population sequential sampling.
In a generative model you might:
set/learn \(M\) (population size)
model \(n\) (red count) as latent
generate \(X\) from
nhypergeom(M,n,r)under a sampling protocol
# 10.1 Hypothesis test: too few reds?
M, r = 60, 6
n0 = 18 # claimed number of reds
k_obs = 6
p_value = stats.nhypergeom.cdf(k_obs, M, n0, r)
print("p-value (P[X<=k_obs] under H0):", p_value)
# 10.2 Bayesian posterior on n (uniform prior over feasible n)
n_grid = np.arange(k_obs, M - r + 1) # must satisfy n >= k_obs and r <= M-n
log_prior = -np.log(n_grid.size) * np.ones_like(n_grid, dtype=float)
log_like = np.array([nhypergeom_logpmf(k_obs, M=M, n=int(n), r=r) for n in n_grid])
log_post_unnorm = log_prior + log_like
logZ = logsumexp(log_post_unnorm)
post = np.exp(log_post_unnorm - logZ)
post_mean_n = float(np.sum(n_grid * post))
cdf_post = np.cumsum(post)
ci_low = int(n_grid[np.searchsorted(cdf_post, 0.05)])
ci_high = int(n_grid[np.searchsorted(cdf_post, 0.95)])
print("posterior mean of n:", post_mean_n)
print("90% credible interval for n:", (ci_low, ci_high))
fig = go.Figure()
fig.add_trace(go.Bar(x=n_grid, y=post, name="posterior"))
fig.update_layout(
title=f"Posterior over n given k_obs={k_obs} (M={M}, r={r})",
xaxis_title="n (reds in population)",
yaxis_title="Posterior probability",
)
fig.show()
# 10.2 Posterior predictive for a future experiment with the same (M,r)
ks = np.arange(0, n_grid.max() + 1)
pmf_mix = np.zeros_like(ks, dtype=float)
for weight, n_val in zip(post, n_grid):
ks_n, pmf_n = nhypergeom_pmf_array(M, int(n_val), r)
pmf_mix[ks_n] += weight * pmf_n
fig_pp = go.Figure()
fig_pp.add_trace(go.Bar(x=ks, y=pmf_mix, name="posterior predictive"))
fig_pp.update_layout(
title="Posterior predictive PMF for X",
xaxis_title="k (reds before r blues)",
yaxis_title="Probability",
)
fig_pp.show()
p-value (P[X<=k_obs] under H0): 0.976727427779557
posterior mean of n: 29.999999999999982
90% credible interval for n: (18, 42)
11) Pitfalls#
Parameterization confusion: some references define the RV as “# failures before \(r\) successes” (a swapped story). Always confirm what \(r\) and \(k\) mean.
Invalid parameters: in SciPy’s definition you must have integers with \(0\le n\le M\) and \(0\le r\le M-n\).
Degenerate cases: \(r=0\) or \(n=0\) collapse the distribution to a point mass at 0.
Numerical stability: factorials / binomial coefficients overflow quickly; prefer
logpmfwithgammalnand only exponentiate at the end.Runtime: naive sequential simulation can be slow when expected stopping time is large; the “blue position” sampler is often faster.
12) Summary#
nhypergeomis a discrete distribution for sequential sampling without replacement from a finite population.In SciPy’s parameterization, it models the number of red items observed by the time the \(r\)-th blue item appears.
The PMF has a clean combinatorial form and is closely related to the hypergeometric distribution.
Mean/variance (and skew/kurtosis) have closed forms via an identity with the beta-binomial distribution.
Sampling can be done NumPy-only via order statistics of blue positions or a sequential urn simulation.
SciPy provides
pmf/cdf/rvs/stats/entropy; fitting is done viascipy.stats.fit.
References#
SciPy docstring:
scipy.stats.nhypergeomSciPy docstring:
scipy.stats.fitWikipedia: “Negative hypergeometric distribution”